-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml:add new member in GGML's internal data structure #2073
Conversation
We have a mechanism in place for using multiple backends simultaneously with |
thanks for your comment. I'll try it whether ggml_backend_sched can cover the above scenarios. I think the method in this PR is more clear/straight before I study more details in ggml_backend_shed |
Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment to support this case. There is a work in progress in ggerganov/llama.cpp#6210 that uses the |
ok, I see. but I think you might change your mind if you try/validate this PR on a Qualcomm SoC based phone. at the same time, I personally think the method in this PR re-use the existing codes as much as possible and it's simple/clear and can cover more scenarios in a straight way. the ggml_backend_sched is an excellent idea but the design of ggml_backend_sched might-be brings more complicated things into GGML internal: the API designer/maintainer should not do much work what should be done in backend's implementation and I think the existing highly well-designed ggml_backend subsystem is enough at the moment. for example, the upper layer code just want to test GGML_OP_MUL_MAT using different backend(including the default GGML CPU backend), the excellent but complicated ggml_backend_sched might be not suitable/proper for this scenario. thanks for your comment. |
7cedf6c
to
e75c029
Compare
Purpose
this PR is for multi purpose:
(1) borrow some advantages from Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK, it's a highly well-designed and concise SDK(the API in QNN SDK is really match with GGML's existing design(including existing backend design)).
this PR borrow the "rank" member from QNN SDK and re-use the existing code as much as possible and not brings side-effect or complexity to existing code
(2) borrow some advantages from PyTorch(the user could specify whether a GGML OP(such as mulmat) is accelerated by a specify backend)
this PR borrow the "use_hwaccel" member from PyTorch and re-use the existing code as much as possible and not brings side-effect or complexity to existing code
(3) cover more scenarios from upper layer code(see section "Explanation"), not including using multiple backends simultaneously because that's another topic/scenario
(4) (the main purpose is) prepare for submit Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend to upstream GGML community (the first step in whisper.cpp, then in llama.cpp): PoC: Add Qualcomm mobile SoC native backend for GGML
Status
this PR has verified/validated in whisper.cpp + QNN backend on different Android phone(Qualcomm SoC based low-end phone and Qualcomm SoC based high-end phone).
Explanation
this PR is useful/helpful/meaningful for some scenarios.
in the fact, the "gpu_device" in struct whisper_context_params is similar to use_hwaccel semantically. there are 2 * n combinations here: 2(use_gpu:true/false) * n (backend_device:1-n, attention here: a DSP backend is also considered as a "gpu_device", because we should re-use the existing code as much as possible and not brings side-effect or complexity to existing code)
so a special value of "gpu_device" could be considered/assumed non hardware acceleration or fall into the original default GGML CPU backend.
accordingly, we can re-use the existing "backend" member in core data structure with a new member "use_hwaccel" in struct ggml_context. btw, I personally think we should not remove the existing "backend" from "struct ggml_tensor" although there is a plan to remove this member:
the reason for this is that there are some different scenarios(not including using multiple backends simultaneously, that's another topic/scenario) which "use_gpu&gpu_device" cannot cover:
I personally think these new members("use_hwaccel" in struct ggml_context & "rank" in struct ggml_tensor) are not redundant and these new members will NOT bring side-effect to existing codes. Of course, I understand we should not bring too much "useful codes" into existing implementation of GGML internal and we should keep GGML as compact/clean as possible, so this PR reuses existing code to the maximum extent or as much as possible:
https://github.com/zhouwg/whisper.cpp/blob/add_hwaccel_in_data_structure/ggml.c#L2995
PR approval request
@hey-shashikant, thanks for your time and approval of #2054 (which is same to this PR), could you help to take a look this PR? thanks so much.
@slaren, I'm sorry to interrupt you, could you help to take a look? thanks
@ggerganov , I'm sorry to interrupt you, could you help to take a look? thanks